Stochastic Language Models for Automatic Acquisition of Lexicons from Printed Bilingual Dictionaries

نویسندگان

  • Song Mao
  • Tapas Kanungo
چکیده

Electronic bilingual lexicons are crucial for machine translation, cross-lingual information retrieval and speech recognition. For low-density languages, however, the availability of electronic bilingual lexicons is questionable. One solution is to acquire electronic lexicons from printed bilingual dictionaries. While manual data entry is a possibility, automatic acquisition of lexicons from scanned images of bilingual dictionaries would expedite the prototyping process of cross-language systems. Printed dictionaries have a logical model that defines the syntax of the dictionary entries – i.e. order of the dictionary entry, its part of speech, its pronunciation and its definition. In this article we propose an algorithm to automatically extract bilingual dictionary entries based on stochastic language models. We demonstrate this algorithm on a printed Chinese-English dictionary. This work can be easily used for extracting information from other tabular structures like telephone books, catalogs, etc.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Acquisition of Bilingual MT Lexicons from OCRed Dictionaries

This paper describes an approach to analyzing the lexical structure of OCRed bilingual dictionaries to construct resources suited for machine translation of low-density languages, where online resources are limited. A rule-based, an HMM-based, and a post-processed HMM-based method are used for rapid construction of MT lexicons based on systematic structural clues provided in the original dictio...

متن کامل

Towards Semi Automatic Construction of a Lexical Ontology for Persian

Lexical ontologies and semantic lexicons are important resources in natural language processing. They are used in various tasks and applications, especially where semantic processing is evolved such as question answering, machine translation, text understanding, information retrieval and extraction, content management, text summarization, knowledge acquisition and semantic search engines. Altho...

متن کامل

Semi-Automatic Acquisition of Domain-Specific Translation Lexicons

We investigate the utility of an algorithm for translation lexicon acquisition (SABLE), used previously on a very large corpus to acquire general translation lexicons, when that algorithm is applied to a much smaller corpus to produce candidates for domain-specific translation lexicons. 1 I n t r o d u c t i o n Reliable translation lexicons are useful in many applications, such as cross-langua...

متن کامل

Cross-Lingual Bootstrapping of Semantic Lexicons: The Case of FrameNet

This paper considers the problem of unsupervised semantic lexicon acquisition. We introduce a fully automatic approach which exploits parallel corpora, relies on shallow text properties, and is relatively inexpensive. Given the English FrameNet lexicon, our method exploits word alignments to generate frame candidate lists for new languages, which are subsequently pruned automatically using a sm...

متن کامل

Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization

Cross-language Text Categorization is the task of assigning semantic classes to documents written in a target language (e.g. English) while the system is trained using labeled documents in a source language (e.g. Italian). In this work we present many solutions according to the availability of bilingual resources, and we show that it is possible to deal with the problem even when no such resour...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001